Skip to content

Conversation

danbev
Copy link
Member

@danbev danbev commented Sep 22, 2025

This PR consists of three commits but the first two are part of #16004.

This commit add support for testing the ggml-cpu repack feature which
enables the repackaging of quantized data into more optimal layout for
matrix multiplication for specific hardware architectures.

The motivation is to enable the testing of a cpu backend that uses
repacked data against a reference cpu backend that does not use repacked
data.

Building:

$ cmake -B build \
    -DGGML_CPU_REF_BACKEND=ON \
    -DGGML_BACKEND_DL=ON \
    -DGGML_CPU_ALL_VARIANTS=ON

List availble cpu architectures/variants:

$ ./build/bin/test-backend-ops cpu-variants --list
CPU variants:
  CPU-haswell     - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-sse42       - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-x64         - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-alderlake   - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-sandybridge - 12th Gen Intel(R) Core(TM) i7-1260P

Run tests:

./build/bin/test-backend-ops cpu-variants \
    --variant CPU-alderlake \
    -o "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"

Testing CPU variant 'CPU-alderlake' against cpu-ref backend...

repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  14491/14491 tests passed

All matrix multiplication tests can be run by use specifying -o "MUL_MAT, MUL_MAT_ID".

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Sep 22, 2025
@danbev danbev marked this pull request as ready for review September 24, 2025 08:37
@danbev danbev requested a review from slaren as a code owner September 24, 2025 08:37
@ggerganov
Copy link
Member

I don't think the reference CPU backend is using the generic implementations for some reason. Here is what I do on Mac:

cmake .. -DGGML_CPU_REF_BACKEND=ON -DGGML_BACKEND_DL=ON -DGGML_BLAS=OFF -DGGML_METAL=OFF
make -j && ./bin/test-backend-ops cpu-variants --list

load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu.so
load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu-ref.so
CPU variants:
  CPU             - Apple M4 Max

make -j && ./bin/test-backend-ops cpu-variants --variant CPU -o MUL_MAT -p "q4_0"

If I put prints like this, I see only the Arm implementation being executed:

diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..fff646064 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
 //===================================== Dot products =================================
 
 void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("arm\n");
     const int qk = QK8_0;
     const int nb = n / qk;
 
diff --git a/ggml/src/ggml-cpu/quants.c b/ggml/src/ggml-cpu/quants.c
index 365cb36d2..3694ae145 100644
--- a/ggml/src/ggml-cpu/quants.c
+++ b/ggml/src/ggml-cpu/quants.c
@@ -113,6 +113,7 @@ void quantize_row_q8_K_generic(const float * GGML_RESTRICT x, void * GGML_RESTRI
 //===================================== Dot products =================================
 
 void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("generic\n");
     const int qk = QK8_0;
     const int nb = n / qk;

@danbev
Copy link
Member Author

danbev commented Sep 24, 2025

@ggerganov Thanks for trying this out. I'll take a closer look shortly 👍

With fix in 22ef44d I was able to to see both arm and generic output though I did have to add the generic printf to ggml-quants.c:

diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..689148ae7 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
 //===================================== Dot products =================================

 void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("arm...\n");
     const int qk = QK8_0;
     const int nb = n / qk;

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index 727932123..058b0f6a6 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -197,6 +197,7 @@ void quantize_row_q5_1_ref(const float * GGML_RESTRICT x, block_q5_1 * GGML_REST

 // reference implementation for deterministic creation of model files
 void quantize_row_q8_0_ref(const float * GGML_RESTRICT x, block_q8_0 * GGML_RESTRICT y, int64_t k) {
+    printf("generic....\n");
     assert(k % QK8_0 == 0);
     const int nb = k / QK8_0;

I've also stepped through this in the

debugger

(lldb) target create "build/bin/test-backend-ops"
Current executable set to '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64).
(lldb) settings set -- target.run-args  "cpu-variants" "--variant" "CPU" "-o" "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"
(lldb) br set -f ggml-backend.cpp -l 2035                                                             Breakpoint 1: where = libggml-base.dylib`::ggml_backend_compare_graph_backend(ggml_backend_t, ggml_backend_t, ggml_cgraph *, ggml_backend_eval_callback, void *, ggml_tensor *) + 668 at ggml-backend.cpp:2035:40, address = 0x0000000000022734
(lldb) r
Process 58647 launched: '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64)
ggml_backend_load_best: failed to find ggml_backend_score in /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3)
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU-ref (Apple M3)
Testing CPU variant 'CPU' against cpu-ref backend...

Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x000000010041a734 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2035:40
   2032	            struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1);
   2033	            struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
   2034
-> 2035	            ggml_backend_graph_compute(backend1, &g1v);
   2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
   2038	            if (ggml_is_view_op(t1->op)) {
Target 0: (test-backend-ops) stopped.
(lldb) n
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x000000010041a740 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2036:40
   2033	            struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
   2034
   2035	            ggml_backend_graph_compute(backend1, &g1v);
-> 2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
   2038	            if (ggml_is_view_op(t1->op)) {
   2039	                continue;
Target 0: (test-backend-ops) stopped.
(lldb) n
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x000000010041a748 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2038:33
   2035	            ggml_backend_graph_compute(backend1, &g1v);
   2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
-> 2038	            if (ggml_is_view_op(t1->op)) {
   2039	                continue;
   2040	            }
   2041
Target 0: (test-backend-ops) stopped.

I realized that this was because I had not disabled GGML_LLAMAFILE when building. Setting this to OFF then printed the output like you added. I'm looking into adding disabling this in the build now. I've disabled LLAMAFILE, HBM, OpenMP, and KleidiAI for the cpu reference backend now.

Comment on lines 1992 to 1994
// A problem arises in `ggml_backend_compare_graph_backend` where the graph
// is copied, and ggml_backend_buffer_repack_buffer_type does not implement
// the get_tensor function, but even if it did it would return the repacked data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution for this would be to use the CPU reference implementation as the main backend to copy data from, rather than the tested backend. Basically, the order of the backends in test-backend-ops should be swapped. This would also solve a problem where if the backend has a buggy buffer implementation, it can lead to corrupted tensor data when the data is copied from the original backend to the CPU backend. This caused issues before when testing some backends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, I'll give that a try, thanks!

@danbev
Copy link
Member Author

danbev commented Sep 25, 2025

@slaren Thanks for your reviews!
I've tried to apply them I've changed the order of the backends which works, but I'm trying to figure out how I can get the extra buffer types (repack) to work with this solution. I'll continue looking into but wanted to let you know in case you try it out you won't get surprised.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, this is not enough to ensure that only the C implementation is used in the CPU-ref backend, because the base ABI may include some vector instructions. For example, the Linux AMD64 ABI includes SSE2, and if you add a feature flag for this instruction set, it will show up in the CPU-ref backend:

diff --git a/ggml/include/ggml-cpu.h b/ggml/include/ggml-cpu.h
index 9edd48513..004acea31 100644
--- a/ggml/include/ggml-cpu.h
+++ b/ggml/include/ggml-cpu.h
@@ -75,6 +75,7 @@ extern "C" {
     //

     // x86
+    GGML_BACKEND_API int ggml_cpu_has_sse2       (void);
     GGML_BACKEND_API int ggml_cpu_has_sse3       (void);
     GGML_BACKEND_API int ggml_cpu_has_ssse3      (void);
     GGML_BACKEND_API int ggml_cpu_has_avx        (void);
diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
index c13129084..06c8f54df 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -3428,6 +3428,14 @@ int ggml_cpu_has_llamafile(void) {
 #endif
 }

+int ggml_cpu_has_sse2(void) {
+#if defined(__SSE2__)
+    return 1;
+#else
+    return 0;
+#endif
+}
+
 int ggml_cpu_has_sse3(void) {
 #if defined(__SSE3__)
     return 1;
diff --git a/ggml/src/ggml-cpu/ggml-cpu.cpp b/ggml/src/ggml-cpu/ggml-cpu.cpp
index 90dd65c41..6f42b0a8f 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.cpp
+++ b/ggml/src/ggml-cpu/ggml-cpu.cpp
@@ -506,6 +506,9 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
         ggml_cpu_init();

         std::vector<ggml_backend_feature> features;
+        if (ggml_cpu_has_sse2()) {
+            features.push_back({ "SSE2", "1" });
+        }
         if (ggml_cpu_has_sse3()) {
             features.push_back({ "SSE3", "1" });
         }

Other than that, I think this could be good enough for a first version. To be useful as part of a CI test however, test-backend-ops will need to be modified to be able to load every CPU backend variant supported by the system, not just the one selected by ggml_backend_load_best.

@danbev
Copy link
Member Author

danbev commented Sep 26, 2025

Other than that, I think this could be good enough for a first version.

Do you have any suggestions to how to also avoid compiler optimizations because of the availability of SSE2, or is it perhaps enough that the cpu-ref backend does not use any explicit SIMD operations?

@slaren
Copy link
Member

slaren commented Sep 26, 2025

Compiler optimizations are not important, we are trying to test our code, not the compiler. The calling convention of AMD64 may actually require using SSE registers, so it may not be possible to disable SSE completely. It may be possible to do something like this:

#ifdef GGML_CPU_GENERIC
#undef __SSE2__
#endif

@danbev
Copy link
Member Author

danbev commented Sep 26, 2025

@slaren When you get a chance would you be able to help me with the repacking?
In caa91b5 I've been able to force this to work, but I'm not sure what the best way to do this is.

@slaren
Copy link
Member

slaren commented Sep 29, 2025

I don't have any specific suggestions, I think that it will take some iteration to reach a good implementation. I would consider moving the code of ggml_backend_graph_copy to test-backend-ops so that it is easier to modify it for its needs. It may also be convenient to add a new API function (obtained with ggml_backend_reg_get_proc_address) to get an explicit list of operations and types that an extra buffer type supports, because at the moment there isn't a good way to obtain this information.

@ggerganov
Copy link
Member

ggerganov commented Oct 1, 2025

I think it would be useful to demonstrate how to apply this functionality to detect a few known issues with the CPU backend that we are aware of:

Currently these problems go undetected and I think it would be useful to know how to apply the new CPU reference backend and the functionality in this PR to find these. Based on this, we can think if it would make sense to create some CI workflows to automate this (though we have to be mindful about the performance and how often to run those).

Since this involves running things on hardware that is hard to access (SVE, AMX, etc.), we can probably create a separate workflow just for this and experiment with it to avoid flooding the regular CI with extra jobs?

@danbev
Copy link
Member Author

danbev commented Oct 1, 2025

Since this involves running things on hardware that is hard to access (SVE, AMX, etc.), we can probably create a separate workflow just for this and experiment with it to avoid flooding the regular CI with extra jobs?

I'll take a look at this as soon as I've got the repack part working 👍

Example of manually triggered run testing CPU-armv8.2_2 against the CPU-ref backend for MUL_MAT

@danbev danbev force-pushed the test-backend-ops-repack branch from 9dfbd50 to 50c1e6e Compare October 2, 2025 05:32
@danbev danbev requested a review from CISC as a code owner October 2, 2025 12:18
@danbev danbev marked this pull request as draft October 2, 2025 12:18
@danbev danbev removed the request for review from CISC October 2, 2025 12:18
@github-actions github-actions bot added the devops improvements to build systems and github actions label Oct 2, 2025
@danbev danbev force-pushed the test-backend-ops-repack branch 2 times, most recently from 0a421c0 to 2b770f0 Compare October 3, 2025 11:13
@netrunnereve
Copy link
Collaborator

Currently these problems go undetected and I think it would be useful to know how to apply the new CPU reference backend and the functionality in this PR to find these. Based on this, we can think if it would make sense to create some CI workflows to automate this (though we have to be mindful about the performance and how often to run those).

This actually can be done right now by doing a test-backend-ops against the Vulkan backend with llvmpipe. Obviously it would be way faster if we could directly compare two cpu implementations though.

danbev added 5 commits October 7, 2025 06:20
This commit introduces a CPU reference implementation for GGML,
designed primarily for testing and validation purposes.

The motivation for this addition is to have a pure C CPU backend
implementation that does not use any hardware-specific optimizations
or intrinsics. This will allow for testing the CPU backend variants
against the reference implementation to ensure correctness
This commit add support for testing the ggml-cpu repack feature which
enables the repackaging of quantized data into more optimal layout for
matrix multiplication for specific hardware architectures.

The motivation is to enable the testing of a cpu backend that uses
repacked data against a reference cpu backend that does not use repacked
data.

Building:
```console
$ cmake -B build \
    -DGGML_CPU_REF_BACKEND=ON
    -DGGML_BACKEND_DL=ON \
    -DGGML_CPU_ALL_VARIANTS=ON
```

List availble cpu architectures/variants:
```console
$ ./build/bin/test-backend-ops cpu-variants --list
CPU variants:
  CPU-alderlake   - 12th Gen Intel(R) Core(TM) i7-1260P
```
Run tests:
```console
./build-ref/bin/test-backend-ops cpu-variants \
    --variant CPU-alderlake \
    -o "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"

Testing CPU variant 'CPU-alderlake' against cpu-ref backend...

repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  14491/14491 tests passed
```
All matrix multiplication tests can be run by use specifying
`-o "MUL_MAT"` but it may be harder to spot the ones that use repacking.
This commit adds ggml.c to the test-backend-ops compilation unit as on
windows the linker cannot resolve some of the functions that have been
moved into test-backend-ops, like ggml_hash_set_new. I'm not sure we
will keep this moving but for now this will at least allow the test to
be run on windows as well.
danbev added 2 commits October 7, 2025 06:20
This is a work in progress to try out different CPU variants using the
CPU reference backend with repacking enabled. The workflow will be
manually triggered and allows specifying the operation and variant
to test.
@danbev danbev force-pushed the test-backend-ops-repack branch from 2b770f0 to fe21170 Compare October 7, 2025 04:23
@danbev
Copy link
Member Author

danbev commented Oct 7, 2025

@slaren I've tried moving ggml_backend_graph_copy to test-backend-ops.cpp and I've got this to work. When you get a chance please take a look and see what you think.

@ggerganov Regarding the testing I tried to use the cpu ref backend to find the SVE issue but it did not uncover it, but I did use the test-cpu-variants workflow for that issue which was useful just the same. This is currently a manual workflow with only a single job at this point. Just running the job without specifying a variant will list all the variants available, and a variant can be specified when running the workflow. It also supports disabling the repack backend as an example of options we could add. I'm not sure if you had something like this in mind for this but perhaps this can be a base for discussion?

I've kept the tmate action in the test-cpu-variants workflow for now which is an external action (non ggml-org), but I find it very useful to be able to ssh into the runner and run build/test manually.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Regarding the testing I tried to use the cpu ref backend to find the SVE issue but it did not uncover it, but I did use the test-cpu-variants workflow for that issue which was useful just the same. This is currently a manual workflow with only a single job at this point. Just running the job without specifying a variant will list all the variants available, and a variant can be specified when running the workflow. It also supports disabling the repack backend as an example of options we could add. I'm not sure if you had something like this in mind for this but perhaps this can be a base for discussion?

I was thinking mainly about the following use case: if for example there is a bug in the ARM version of ggml_vec_dot_q4_0_q8_0() it would be nice to be able to detect it by comparing it to the ggml_vec_dot_q4_0_q8_0_generic() implementation.

Comment on lines 1872 to 1910
static enum ggml_status ggml_backend_cpu_repack_buffer_init_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
tensor->extra = (void *) const_cast<ggml::cpu::tensor_traits *>(ggml_repack_get_optimal_repack_type(tensor));
if (tensor->op == GGML_OP_NONE) {
tensor->extra = (void *) const_cast<ggml::cpu::tensor_traits *>(ggml_repack_get_optimal_repack_type(tensor));
tensor->buffer = buffer;
}

if (supports_tensor(tensor)) {
tensor->src[0]->extra = (void *) const_cast<ggml::cpu::tensor_traits *>(ggml_repack_get_optimal_repack_type(tensor->src[0]));
tensor->src[0]->buffer = buffer;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic for this change - would need some clarification.

Copy link
Member Author

@danbev danbev Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test-backend-ops the CPU ref backend has it tensors copied (the order is switched, see comment for details) to the backend variant. Then in ggml_backend_graph_copy, which has been copied into test-backend-ops, will call ggml_backend_buffer_init_tensor which is how we end up in this function when running the tests.

Before this change the passed-in tensor would always get its extra tensor set, either to null or to a trait if the tensor's type is supported and the underlying backend also supports it.

There are test in test-backend-ops that use broadcasting, for example setting nr to either [2, 1] or [1, 2], and when using repack this causes these specific tests to fail because repack does not support broadcasting as far as I can tell. supports_tensor checks the dimensions of src[1] that is 2 in addition to some other checks to avoids this situation.

The setting of the buffer is really only required for the tests. When we copy the CPU ref backend's tensors they have the cpu buffer as their backend buffer, but we need this to be the repack buffer, so that repack's get_tensor_traits works and returns the trait. Otherwise the function ggml_cpu_extra_compute_forward called in ggml_compute_forward will not forward the call to repack.

I hope I've been able to explain the motivation for this but please let me know if that is not the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely not ok, you can't change the buffer of a tensor during init_tensor, and you absolutely cannot change the buffer of a source tensor.
test-backend-ops needs to be modified to use the extra buffer types in the way they are intended - that is, only allocating the weights on the extra buffer type, and everything else on the default buffer type. This will likely mean either creating a new type of test, or making deep changes to the way tests are defined. Hacking the backend code to make it work is not ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this logic from the repack.cpp. I've just moved the code to test-backend-ops for now to enable the test to pass so I have a something that works. And I'll take another stab at this tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made another attempt at this in 8c3eb6c. Is this more in line with what you had in mind @slaren?
Let me know if I can rebase this PR to make reviewing easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants